home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.sources.misc
- From: Martin Fouts <fouts@clipper.ingr.com>
- Subject: v21i038: cloops - Livermore Loops in C, Part03/03
- Message-ID: <1991Jul25.020740.28375@sparky.IMD.Sterling.COM>
- X-Md4-Signature: 66bbb34c53801c38f1ef1b86799174b7
- Date: Thu, 25 Jul 1991 02:07:40 GMT
- Approved: kent@sparky.imd.sterling.com
-
- Submitted-by: Martin Fouts <fouts@clipper.ingr.com>
- Posting-number: Volume 21, Issue 38
- Archive-name: cloops/part03
- Environment: Cray2, Alliant Convex Amdahl, Sun, SGI
-
- ----- cut here for Part03
- #! /bin/sh
- # This is a shell archive. Remove anything before this line, then unpack
- # it by saving it into a file and typing "sh file". To overwrite existing
- # files, type "sh file -c". You can also feed this as standard input via
- # unshar, or by typing "sh <file", e.g.. If this archive is complete, you
- # will see the following message at the end:
- # "End of archive 3 (of 3)."
- # Contents: cloops.tex
- # Wrapped by fouts@bozeman on Sun Jul 21 18:23:21 1991
- PATH=/bin:/usr/bin:/usr/ucb ; export PATH
- if test -f 'cloops.tex' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'cloops.tex'\"
- else
- echo shar: Extracting \"'cloops.tex'\" \(28304 characters\)
- sed "s/^X//" >'cloops.tex' <<'END_OF_FILE'
- X% C-Loops
- X%
- X% A Paper Describing the C version of the Livermore Loops
- X% Copyright (C) 1991 Martin Fouts
- X%
- X% This program is free software; you can redistribute it and/or modify
- X% it under the terms of the GNU General Public License as published by
- X% the Free Software Foundation; either version 1, or (at your option)
- X% any later version.
- X%
- X% This program is distributed in the hope that it will be useful,
- X% but WITHOUT ANY WARRANTY; without even the implied warranty of
- X% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- X% GNU General Public License for more details.
- X%
- X% You should have received a copy of the GNU General Public License
- X% along with this program; if not, write to the Free Software
- X% Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- X%
- X
- X\documentstyle[12pt,twoside,fleqn,openbib]{article}
- X\pagestyle{myheadings}
- X\markboth{DRAFT of \today}{DRAFT of \today}
- X
- X\title{ The Livermore Loops in C
- X\\ DRAFT of \today }
- X
- X\author{
- XMartin Fouts \\
- XMathematician \\
- XNAS Systems Development Branch \\
- XNASA Ames Research Center}
- X
- X
- X\begin{document}
- X\maketitle
- X%\tableofcontents
- X%\vfill
- X%\pagebreak
- X\markboth{DRAFT of \today}{DRAFT of \today}
- X
- X\begin{abstract}
- X
- XThe Livermore Loops were transliterated from Fortran into C. The
- Xresulting program was run on several machines which support both a C
- Xand Fortran compiler. The effort taken in porting the loops is
- Xdiscussed, and the results are presented. Some of the more
- Xinteresting results are analyzed. Problems which arise from
- Xusing the Loops as a benchmark suite are discussed. Some conclusions
- Xabout numerical computation in C are drawn.
- X
- X\end{abstract}
- X
- X\section{Background}
- X
- XNumerical calculations, especially those involving vector operations
- Xhave long been the territory of Fortran and it is widely accepted that
- Xno other high level language offers the same opportunity for compiler
- Xoptimization of source code.
- X
- XThe Livermore Loops\cite{McMahon86} are widely used as a benchmark of
- Xnumerical intensive calculation in Fortran and results are available
- Xfor a wide range of machines. The loops were transliterated from
- XFortran into C and results were gathered for several machines. On
- Xall machines the C compiler produces faster overall performance.
- XBoth languages calculated the same results.
- X
- XIn performing the conversion, some difficulties using the Livermore
- XLoops as a benchmark suite were uncovered. They include the ease with
- Xwhich some compilers optimize away timing loops and the lack of
- Xsensitivity in the validation routines. These difficulties are
- Xdiscussed in section 3.
- X
- XSome semantic differences between C and Fortran were uncovered which
- Xhave an effect on the precision of the results generated by the
- Xprogram on certain machines. These differences are discussed in
- Xsection 3.
- X
- XThis experiment makes an interesting comment on the state of
- Xoptimizing compilers, the ability to write Fortran in any language,
- Xand the utility of the Livermore Loops as a vector benchmark.
- X
- XFor the purpose of this paper, Fortran refers to the Fortran language
- Xas described in the Fortran-77 standard\cite{ANSI77} and C refers to
- Xthe programming language as described in \cite{Kern78}. References to
- Xspecific dialects will be labeled as such.
- X
- X\section{Overview of Transliteration}
- X
- XIn converting the Livermore Loops from Fortran to C, the major goal
- Xwas to preserve the original Fortran semantics. Some changes were
- Xnecessary, including:
- X
- X\begin{itemize}
- X
- X\item The routine \tt SPACE \rm was deleted. Since the pointer
- Xassignment isn't used in the Fortran implementation, this has no
- Xeffect on the final results.
- X
- X\item The variable \tt lp \rm which controls the number of passes was
- Xmade an input parameter. The default is 100 passes. When the default
- Xis used, standard results are obtained.
- X
- X\item In kernel 15, arithmetic if statements were modified to
- Xif-then-else forms.
- X
- X\item All occurrences of the form \tt x**2 \rm were replaced by \tt
- Xx*x.
- X
- X\item \tt REPORT \rm was truncated. Only the flop rates and averages
- Xare reported. Since the C version is only intended as a comparison to
- XFortran on the same machine, the analysis portion is not needed.
- X
- X\item \tt SIGNAL \rm was reimplemented as several separate routines to
- Xwork around differences in C and Fortran parameter passing. Further,
- X\tt VECTOR \rm was modified to call \tt lsignalN \rm for each array,
- Xsince C has no equivalent to a \tt COMMON \rm block. \tt SIGNAL \rm
- Xwas renamed to \tt lsignalN \rm where \tt N \rm is replaced by 0, 1,
- X2, or 3 to represent the number of dimensions in the array being
- Xinitialized. The separate routines are not strictly needed, but
- Xclarify the intent of the code.
- X
- XSince \tt SIGNAL \rm is not part of the measured code, this has no
- Xeffect on the reported performance.
- X
- X\item The number generation routine used by \tt SIGNAL \rm was
- Xcarefully reimplemented to duplicate Fortran numeric semantics.
- X
- X\item \tt SUMO \rm was reimplemented as \tt sumo, sumo2, \rm and \tt
- Xsumo3 \rm, depending on the number of dimensions in the array being
- Xinitialized. As with signal, this was done for clarity and is in code
- Xwhich is not part of that being measured.
- X
- X\item A command line option was added allowing only specified loops to
- Xbe run. This was done for debugging. The reported results are from a
- Xsingle run in which all loops were performed.
- X
- X\item All array references were converted from the 1 based subscript
- Xused by Fortran to 0 based subscripts as used by C. This was done to
- Xavoid holes in the C arrays which would have resulted in different
- Xmemory use patterns. This would have an occasional small adverse
- Ximpact on the C version, since some array references now involve
- Xsubtracting one from the indices.
- X
- X\end{itemize}
- X
- XA major change which wasn't made was transposing array indices.
- X
- X\section{Language Differences Encountered}
- X
- XThree major differences between Fortran and C were encountered which
- Xhave an impact on the transliteration.
- X
- X\begin{itemize}
- X
- X\item Different array index basis are used by the languages.
- X\item Fortran uses \tt COMMON \rm blocks to allocate global data,
- Xwhile C merely requires declaring it.
- X\item C and Fortran have different rules for arithmetic precision. Some
- Xminor differences also exist.
- X
- X\end{itemize}
- X
- XEach of these will be discussed in turn.
- X
- XThe array index basis difference has two implications. One of these is
- Xthe way in which indices relate to memory locations in multiple
- Xdimension arrays, and the other is the difference in array basing.
- XC arrays are always indexed from 0 to \tt N - 1 \rm where \tt N \rm is
- Xthe number of elements declared for that index of the array. The
- Xdefault basing in Fortran is to index from 1 to \tt N, \rm although 0
- Xbasing is an option. All of the arrays in the Livermore Loops are 1
- Xbased. In the transliteration, this was explicitly adjusted for. In
- Xthe main, this consists of adjusting loop indices to run from 0 to \tt
- XN - 1 \rm rather than from 1 to \tt N \rm and subtracting 1 from
- Xconstant indices as part of the transliteration.
- X
- XThis can also be a problem when integer values are calculated by a
- Xprogram and then used as loop indices. An attempt was made to find
- Xall such places and convert them appropriately. In a few cases, the
- Xcalculations were left alone, and 1 was subtracted when the integer
- Xvariable was used as a subscript.
- X
- XThe other difference in arrays has to do with multiple dimension
- Xarrays. For example, the Fortran array
- X
- X\begin{verbatim}
- X DIMENSION A(2,2)
- X\end{verbatim}
- X
- X\noindent
- Xis stored in ascending memory addresses as \tt A(1,1), A(2,1), A(1,2),
- XA(2,2) \rm from low address to high address. The equivalent C array
- X
- X\begin{verbatim}
- X float a[2][2];
- X\end{verbatim}
- X
- X\noindent
- Xis stored as \tt a[0][0], a[0][1], a[1][0], a[1][1] \rm from low
- Xaddress to high address. A complete transliteration would transpose
- Xindices from left to right to preserve the actual memory ordering.
- XThis was not done in this transliteration. Assuming that the original
- XFortran had been carefully coded for memory access, this would cause a
- Xdegradation of performance of the C code on some systems.
- X
- XThe Livermore Loops utilize the fairly common Fortran practice of
- Xincluding a number of related items together in a \tt COMMON \rm
- Xblock in order to make global references to those items possible.
- X
- XOne of the side effects of this practice is that Fortran requires that
- Xitems which are adjacently listed in a common block are adjacently
- Xallocated in memory. This makes possible the practice of initializing
- Xan entire \tt COMMON \rm block by treating it as a single array in an
- Xinitialization routine. Figure~1 shows an example in Fortran.
- X
- X\begin{figure}[ht]
- X\small
- X\caption{\tt COMMON \rm usage in Fortran}
- X\begin{verbatim}
- X
- XC
- XC THIS ROUTINE USES THREE ELEMENTS OF A COMMON BLOCK BY SOME NAME
- XC
- X COMMON /CBLOCK/ A1, B1, C1
- X SUBROUTINE AROUTINE
- XC
- XC USE A1, B1, C1
- XC
- X RETURN
- X END
- XC
- XC THIS ROUTINE INITIALIZES THE COMMON BLOCK USING AN ARRAY
- XC
- X COMMON /CBLOCK/ A(3)
- X SUBROUTINE INIT
- X DO 10 I = 1, 3
- X 10 A(I) = I
- X RETURN
- X END
- X
- X\end{verbatim}
- X\end{figure}
- X
- XThe initialization routines in the Livermore Loops take advantage of
- Xthis. There are several \tt COMMON \rm blocks, each of which contains
- Xmany arrays. There is an initialization routine which treats each \tt
- XCOMMON \rm block as a single array, and initializes it in a single
- Xloop.
- X
- XIt is possible to duplicate the adjacent storage effect of \tt COMMON
- X\rm blocks in C. Figure~2 shows an example of this usage.
- X
- X\begin{figure}[ht]
- X\small
- X\caption{Translating \tt COMMON \rm usage to C}
- X\begin{verbatim}
- X
- Xstruct cblock {
- X float A_[3];
- X} CBLOCK;
- X
- X#define A1 CBLOCK.A_[0]
- X#define B1 CBLOCK.A_[1]
- X#define C1 CBLOCK.A_[2]
- X#define A CBLOCK.A_
- X
- Xvoid init()
- X{
- X int i;
- X for (i = 0; i < 3; i++)
- X A[i] = i;
- X}
- X
- Xmain()
- X{
- X init();
- X fprintf(stdout,"A1 = %f, B1 = %f, C1 = %f\n", A1, B1, C1);
- X}
- X
- X\end{verbatim}
- X\end{figure}
- X
- XIt is possible to use this technique within the \tt struct \rm to duplicate
- Xany possible naming which comes up within a \tt COMMON \rm block. In
- Xparticular, multiple arrays can be implemented by using either unions
- Xand preprocessor defines, if they do not overlap, or names and
- Xpreprocessor macro functions if they do overlap.
- X
- XBecause the only uses of the \tt COMMON \rm blocks within the Livermore
- XLoops is to provide global data and to simplify initialization and
- Xchecksum calculation, this approach was not taken, but rather each
- Xvariable was declared as a C global, and the initialization routines
- Xwere modified to explicitly initialize each routine separately. This
- Xhas an impact on both the initialization and checksum software.
- X
- XOn machines for which \tt double \rm provides more precision than \tt
- Xfloat \rm in C, numerical calculations are done differently in C than
- Xin Fortran. When all of the variables involved in a calculation are
- Xof the same precision, Fortran calculates all intermediate results to
- Xthat precision. C promotes all variable to the precision of \tt
- Xdouble, \rm does all calculations in double and converts to the
- Xprecision of the result. This yields different numerical results when
- Xthere are subexpressions, for instance in the case
- X
- X\begin{verbatim}
- Xresult = a * b - c * d
- X\end{verbatim}
- X
- XFortran would calculate the two products, convert to the appropriate
- Xprecision, perform the addition and store the result. C would
- Xcalculate the two products in \tt double \rm precision, perform a \tt
- Xdouble \rm precision add, and then convert the result. This can yield
- Xquite different numbers.
- X
- XThe routine which generates the initial data was carefully coded to
- Xmimic Fortran conversion, so that the initial data would be the same
- Xin the C case as in the Fortran case, and care was taken to see that
- Xsubroutine calls were performed with the correct argument types, but
- Xthe test routines calculate the results using default C promotion
- Xrules.
- X
- X\section{Verifying the Transliteration}
- X
- XThe Convex compilers provide statement counting as a profiler option.
- XThe Fortran and C versions of the loops were run with no optimization
- Xand with statement profiling enabled. The statement counts were
- Xcompared on a statement by statement basis for the two programs. All
- Xloops were found to agree exactly.
- X
- XOnce the statement counts were completed, the two programs were run on
- Xthe Cray 2, which has the same precision for \tt float \rm and \tt
- Xdouble \rm in C. The Fortran version was compiled with CFT77 release
- X3.0 with all optimization disabled. The C version was compiled with
- Xthe UniCos 4.0 C compiler.
- X
- XLoops 1 through 5, 7, 11, 12, 13, 14, 16, 17, 20, and 24 produce
- Xidentical checksums to 12 decimal places, which is nearly the limit of
- Xfloating point precision on the Cray 2. Loops 6, 8, 9, 10, 15, 18,
- X21 and 23 all match to at least five decimal places. Careful
- Xinspection of the loops was made by running the Fortran and C versions
- Xsimultaneously under a source level debugger on a Silicon Graphics
- XIris 4D workstation. Several additional bugs were corrected, but no
- Xdifference in the numerical results was found. It is suspected that
- Xthe remaining differences are due to bugs in the way in which the
- Xchecksums are calculated under C.
- X
- X\section{Overview of Results}
- X
- XThe various tables show the reported megaflop rates for the C and Fortran
- Xversion of the Livermore Loops on several machines which were
- Xavailable to the author. Of the machines which were available, the
- XIris and the Sun were chosen as representative of the current
- Xgeneration of engineering workstation, the Amdahl as representative of
- Xcurrent mainframes, the Convex and Alliant as representative of two
- Xdifferent approaches to minisupercomputers, and the Cray 2 as a
- Xrepresentative supercomputer. All of the machines are running recent
- Xreleases of the vendor's version of the Unix operating system.
- X
- XFor all machines, the overall performance of the C compiler was better
- Xthan the overall performance of the Fortran compiler. For the mainly
- XUnix systems, this was not surprising, but it was surprising on the
- XCray 2. A detailed comparison of the optimizations done by the two
- XCray compilers on loops which were significantly different is
- Xpresented below.
- X
- X\begin{table}[htbp]
- X\caption{Results in Megaflops}
- X\small
- X\begin{tabular}{|r|rr|rr|rrr|}
- X\hline
- X{\bf Manufacture} & \multicolumn{2}{|c|}{Alliant} &
- X\multicolumn{2}{|c|}{Amdahl} & \multicolumn{3}{|c|}{Cray} \\
- X{\bf Model} & \multicolumn{2}{|c|}{FX/8} & \multicolumn{2}{|c|}{5880} &
- X\multicolumn{3}{|c|}{2} \\
- X{\bf Compiler} & cc & fortran & cc & f77 & cc (vector) & cft77 (vector) &
- Xcft77 (scalar) \\
- X\hline
- X1 & 10.72 & 15.02 & 15.02 & 1.88 & 764.77 & 96.89 & 13.53 \\
- X2 & 7.76 & 0.78 & 11.64 & 1.16 & 30.28 & 18.45 & 6.40 \\
- X3 & 7.51 & 1.00 & 15.02 & 1.72 & 21.39 & 69.57 & 11.29 \\
- X4 & 6.55 & 0.72 & 9.00 & 1.03 & 18.32 & 32.10 & 4.16 \\
- X5 & 7.06 & 0.71 & 12.00 & 1.09 & 20.36 & 9.85 & 10.67 \\
- X6 & 5.95 & 0.49 & 10.82 & 1.01 & 21.29 & 9.08 & 3.90 \\
- X7 & 15.92 & 19.10 & 18.73 & 2.51 & 929.28 & 147.31 & 16.35 \\
- X8 & 8.07 & 14.26 & 11.56 & 1.07 & 41.67 & 113.39 & 17.55 \\
- X9 & 14.72 & 10.30 & 25.75 & 1.72 & 748.91 & 113.07 & 13.88 \\
- X10 & 4.54 & 5.45 & 9.09 & 0.55 & 21.40 & 35.32 & 6.68 \\
- X11 & 4.29 & 0.43 & 7.50 & 0.75 & 11.00 & 9.02 & 7.57 \\
- X12 & 4.00 & 5.99 & 6.66 & 0.60 & 246.53 & 30.61 & 5.24 \\
- X13 & 1.68 & 0.27 & 4.48 & 0.24 & 6.44 & 1.75 & 1.66 \\
- X14 & 3.06 & 1.16 & 5.46 & 0.52 & 14.61 & 7.22 & 3.88 \\
- X15 & 5.29 & 2.54 & 5.86 & 0.63 & 23.05 & 6.94 & 6.66 \\
- X16 & 4.54 & 0.45 & 10.60 & 1.59 & 15.28 & 4.00 & 3.86 \\
- X17 & 7.79 & 1.82 & 13.63 & 2.73 & 36.74 & 8.08 & 7.81 \\
- X18 & 10.62 & 7.26 & 9.07 & 0.88 & 64.29 & 100.59 & 11.61 \\
- X19 & 7.27 & 0.73 & 12.12 & 1.21 & 27.08 & 12.40 & 11.34 \\
- X20 & 10.91 & 1.47 & 17.93 & 2.40 & 50.41 & 12.33 & 11.39 \\
- X21 & 7.13 & 1.82 & 6.07 & 0.60 & 15.16 & 23.43 & 6.13 \\
- X22 & 6.06 & 5.15 & 8.59 & 0.86 & 47.05 & 90.76 & 5.94 \\
- X23 & 7.10 & 0.93 & 9.33 & 0.91 & 30.69 & 9.41 & 8.99 \\
- X24 & 3.53 & 6.00 & 10.00 & 1.00 & 8.32 & 1.95 & 1.87 \\
- XHarmonic Mean & 5.59 & 1.09 & 9.37 & 0.87 & 22.38 & 9.26 & 5.7 \\
- X\hline
- X\end{tabular}
- X\end{table}
- X
- X
- X\begin{table}[htbp]
- X\caption{Results in Megaflops}
- X\small
- X\begin{tabular}{|r|rr|rr|rr|}
- X\hline
- X{\bf Manufacture} & \multicolumn{2}{|c|}{Convex} &
- X\multicolumn{2}{|c|}{Silicon Graphics} & \multicolumn{2}{|c|}{Sun} \\
- X{\bf Model} & \multicolumn{2}{|c|}{C1} & \multicolumn{2}{|c|}{Iris 4D} &
- X\multicolumn{2}{|c|}{Sun 3/250} \\
- X{\bf Compiler} & vc & fc & Mips cc & Mips f77 & cc & f77 \\
- X\hline
- X1 & 1102.90 & 175.23 & 8.94 & 1.32 & 1.35 & 0.18 \\
- X2 & 8.49 & 1.57 & 7.76 & 1.29 & 1.11 & 0.12 \\
- X3 & 5.82 & 9.57 & 6.90 & 1.33 & 1.10 & 0.20 \\
- X4 & 7.37 & 1.34 & 6.00 & 0.71 & 0.99 & 0.11 \\
- X5 & 6.81 & 1.32 & 5.00 & 0.69 & 1.05 & 0.11 \\
- X6 & 4.16 & 3.80 & 5.67 & 0.86 & 0.97 & 0.12 \\
- X7 & 1811.50 & 331.52 & 13.38 & 1.75 & 1.93 & 0.18 \\
- X8 & 10.82 & 247.95 & 12.73 & 1.25 & 1.41 & 0.16 \\
- X9 & 1428.90 & 195.82 & 13.21 & 1.56 & 1.66 & 0.16 \\
- X10 & 2.57 & 7.15 & 8.26 & 0.91 & 0.74 & 0.09 \\
- X11 & 4.72 & 2.18 & 3.57 & 0.63 & 0.72 & 0.08 \\
- X12 & 2.55 & 16.34 & 3.57 & 0.62 & 0.75 & 0.08 \\
- X13 & 1.73 & 0.56 & 2.24 & 0.14 & 0.45 & 0.05 \\
- X14 & 4.24 & 1.34 & 4.27 & 0.37 & 0.73 & 0.08 \\
- X15 & 3.94 & 1.65 & 5.14 & 0.61 & 1.41 & 0.23 \\
- X16 & 5.18 & 0.88 & 7.57 & 0.88 & 1.18 & 0.17 \\
- X17 & 7.55 & 1.31 & 11.36 & 1.82 & 0.99 & 0.16 \\
- X18 & 8.50 & 11.38 & 9.99 & 1.12 & 1.35 & 0.16 \\
- X19 & 7.89 & 1.83 & 5.51 & 1.01 & 0.96 & 0.12 \\
- X20 & 7.17 & 1.60 & 10.83 & 1.42 & 1.72 & 0.25 \\
- X21 & 3.97 & 7.66 & 6.93 & 0.91 & 0.84 & 0.12 \\
- X22 & 5.60 & 101.18 & 6.87 & 0.90 & 1.94 & 0.32 \\
- X23 & 6.59 & 2.52 & 9.07 & 1.05 & 1.21 & 0.14 \\
- X24 & 90.46 & 9.16 & 4.35 & 0.63 & 1.13 & 0.14 \\
- XHarmonic Mean & 5.62 & 2.37 & 6.09 & 0.72 & 1.02 & 0.13 \\
- X\hline
- X\end{tabular}
- X\end{table}
- X
- X\section{Alliant Performance}
- X
- XThe Alliant Fortran compiler performs analysis to determine if a loop
- Xis vectorizable or can be executed concurrently on multiple
- Xprocessors. If the loop can be executed concurrently and in vector
- Xmode, it tries to find a way to use both vectorization and
- Xconcurrently simultaneously. The Alliant C compiler performs neither
- Xvectorization nor concurrency analysis and always produces scalar
- Xsingle processor code. The Alliant Fortran compiler suffers from
- Xtrying to hard. Sometimes it produces vector or concurrent code for
- Xwhich the overhead of the code exceeds the gain from the facility. In
- Xthese case, such as loops 3, 9 and 13, the scalar C code performs more
- Xefficiently than the optimized Fortran code. In other cases, such as
- Xloops 1 and 7, the Fortran code out performs the C code. In no case
- Xdoes the optimization give a speedup of more than 2 over the scalar
- Xcode.
- X
- X\section{Amdahl Performance}
- X
- XThe Amdahl Fortran compiler is a straightforward port of the BSD Unix
- Xf77 compiler. It does not appear to have been tuned to the Amdahl
- Xarchitecture at all and generates very poor code. The Amdahl C
- Xcompiler is more tuned to the architecture and in some cases, Amdahl
- Xscalar code performance approaches that of the Cray 2. Because of the
- Xlarge difference between the compilers, a detailed comparison was not
- Xundertaken.
- X
- X\section{Cray 2 Performance}
- X
- XThere are four loops (1, 7, 9, and 12) for which C achieves
- X``impossible'' performance on the Cray 2, as a result of exceeding the
- Xmaximum performance of a single processor. These are the result of an
- Xartifact of the benchmark. C outperforms the Fortran code on most
- Xscalar loops, (5, 11, 13, 16, 17, 19, 20, and 24) with one exception.
- X(22) Fortran recognizes more loops as vectorizable than C, and
- Xproduces better performance on the one loop (18) which both compilers
- Xvectorize. There is mixed performance on those loops which Fortran
- Xvectorizes and C does not. In some cases, (2, 6, 14, 15, and 23) the
- Xscalar C code is more efficient that the vector Fortran code. In the
- Xremaining cases (3, 4, 8 10 and 21) the vector Fortran code is more
- Xefficient than the scalar C code.
- X
- XThe impossible performance of C is a result of an artifact of the
- Xbenchmarks. To obtain a large enough time interval for meaningful
- Xmeasurement, all of the loops are of the form:
- X
- X\begin{verbatim}
- X
- XFOR I = 1 TO NUMBER_OF_PASSES DO
- X BEGIN
- X do_the_loop_in_question
- X END
- X
- X\end{verbatim}
- X
- XIn the impossible cases, the loop in question is of form similar to:
- X
- X\begin{verbatim}
- X
- XFOR J = 1 to ARRAY_LENGTH DO
- X BEGIN
- X X[J] := Y[J] op Z[J];
- X END
- X
- X\end{verbatim}
- X
- XIn this very simple case, it is possible for a reasonable compiler to
- Xdetermine that the entire inner loop is invariant with respect to the
- Xouter loop, move the inner loop outside the outer loop, and then throw
- Xthe outer loop away because it doesn't compute anything. If the
- Xvariable on the left side of the equation never appears on the right
- Xside, the loop is invariant and can be collapsed. Close inspection
- Xshows that the performance advantage of C on Loops 1, 7, 9, and 12 can
- Xbe explained as a result of this optimization of the outer loop. The
- XFortran compiler misses this optimization.
- X
- XThe original Fortran version of Loop 2 contains a compiler directive
- Xto force vectorization. The C version does not force vectorization,
- Xbut instead produces scalar code. Further, the vector length is 101,
- Xwhich is fairly short compared to the overhead introduced by setting
- Xup the vector loop. C produces faster code for this case, as a result
- Xof not vectorizing the loop. Loop 6 shows a similar phenomena. In
- Xthis case, Fortran recognizes a conditional dependency and generates
- Xcode to use vector instructions when it doesn't occur. Again the
- Xoverhead is not justified and the C version is faster. This also
- Xhappens in loops 14 and 23.
- X
- XIn other cases, the Fortran compiler vectorizes and the C compiler
- Xdoes not and the Fortran code code runs faster than the C code. The
- Xperformance improvement varies between 1.5 and 3.2 times.
- X
- XIn all but one of the cases in which both languages generate scalar
- Xcode, C generates more efficient code than Fortran. The exception is
- Xloops 22. Fortran generates inline code for the \tt exp \rm
- Xfunction, while C generates calls to a subroutine. Although current C
- Xcompilers don't usually generate inline code, the ANSI draft standard
- Xprovides a mechanism for defining and implementing inlines, and some
- Xcompilers do support the mechanism.
- X
- XOn the Cray 2, C generates shorter less convoluted loops and uses more
- Xregister\footnote{Actually, local memory, which C treats as a large
- Xregister file.} references than Fortran. Overall, this appears to
- Xgive C a scalar performance advantage over Fortran. Compiling the
- XFortran loops with vectorization disabled shows C scalar performance
- Xto vary from 2 to 5 times faster than Fortran scalar performance.
- X
- X\section{Convex Performance}
- X
- XThe Convex compilers produce considerable documentation of their
- Xactions. Figure~3 shows part of the vectorization summary from the C
- Xcompiler for the file containing the loops. The Convex compiler
- Xrecognizes a wide range of language constructs, including goto driven
- Xloops as potential for vectorization. It is able to realize in some
- Xcases that it should not vectorize a loop and frequently able to give
- Xgood information about why a loop cannot be vectorized.
- X
- X\begin{figure}[ht]
- X\small
- X\caption{Vectorization using Convex C}
- X\begin{verbatim}
- X
- XSource Iter. Vector-
- XLine Var. Start Stop Step ization Reason
- X---------------------------------------------------------------------------
- X 42 k *EXPR* ipntp 2 None Not enough vector code
- X 53 k *EXPR* n 1 PART Vector (50%)
- X 99 ky *EXPR* n 1 FULL Vector
- X 149 l *EXPR* lp 1 None Unable to distribute
- X 167 ip *EXPR* n 1 PART Vector (92%)
- X 213 l *EXPR* lp 1 None Not trip counted
- X 266 k *EXPR* n 1 None Inner loop: exit
- X 329 l *EXPR* lp 1 None Inner loop: unknown trips
- X
- X\end{verbatim}
- X\end{figure}
- X
- XBoth the Convex C and Fortran compilers recognized the same loops as
- Xvectorizable for the same reasons, except for loop 22, which Fortran
- Xrecognized and 24, which C recognized. The compiler which recognized
- Xa loop as vectorizable produced much faster code. If both compilers
- Xproduced scalar code or both produced vector code, the C compiler
- Xproduced more efficient code. The C compiler appears to do a better
- Xjob of managing register allocation and memory references and so
- Xproduces code with less memory bandwidth requirements.
- X
- X\section{Silicon Graphics Performance}
- X
- XThe Silicon Graphics Iris 4D uses a MIPS R2000\cite{Kane87} RISC
- Xprocessor chip set, with a floating point accelerator. RISC
- XArchitecture machines depend heavily on compiler technology to
- Xgenerate efficient code in all cases, and it is not surprising that the
- XMIPS machine performs well. That Fortran did not fare as well as C
- Xcan be attributed to the primary market place of the Iris, which uses
- XC as its major programming language. It appears that more effort has
- Xgone into C compiler than has gone into the Fortran compiler.
- X
- X\section{Sun 3 Performance}
- X
- XThe Sun 3 numbers show a classic example of difference in code
- Xgeneration quality. Sun has spent a lot of time on the C compiler,
- Xwhich is the mainstay of their product and not as much time on the
- XFortran compiler. As a result, the C compiler produces more efficient
- Xcode than the Fortran compiler in all cases.
- X
- X\section{Conclusions}
- X
- XThese results demonstrate that on certain machines, using certain
- Xcompiler implementations, it is possible for a C compiler to produce
- Xmore efficient code than a Fortran compiler for a specific numerical
- Xapplication. Which compiler generates more efficient code appears to
- Xdepend more on the maturity of the compiler than on the nature of the
- Xmachine. From the way in which the loops were transliterated, it is
- Xsafe to say that it is possible to write C code which can be easily
- Xoptimized, as it is possible to write Fortran code which can be easily
- Xoptimized.
- X
- XThis is not to say that C has all of the features which make Fortran a
- Xgood choice for numerical calculation. An informal poll of numerical
- Xprogrammers who are familiar with both Fortran and C produced a
- Xpartial list of features seen to be important which are not present in
- Xthe C language. The list included the integer exponentiation operator,
- Xautomatic dimensioning of parameter arrays, complex numbers and
- Xthe repeat count in formatted i/o. Dynamic memory allocation and
- Xpointers were viewed as an advantage of C over Fortran.
- X
- XCurrent compiler technology is making it more difficult to develop
- Xreasonable benchmarks which measure what they intend to measure.
- XAlthough the Livermore Loops have held up well over time, they are
- Xstarting to fall to optimizing compilers. In particular, the ability
- Xto detect invariance of the outer loop has led to invalid results
- Xbeing returned for particular loops.
- X
- XAs McMahon pointed out, performance can be as much a factor of the
- Xmaturity of compiler technology as of the machine on which the code
- Xruns. This is as true for codes written in C as it is for Fortran.
- XIt is also true for the relative performance of various compilers from
- Xthe same vendor on a given machine.
- X
- XIt is possible to ``overoptimize.'' Vector operations include some
- Xoverhead, such as pipeline start up, which must be accounted for by
- Xthe optimizer. Concurrent operations also introduce overhead. Some
- Xcompilers are more sophisticated about this than others. The Convex
- Xcompiler will not vectorize a loop if it believes that there is
- Xinsufficient code to benefit.
- X
- XSome vendors appear to have paid more attention to vector optimization
- Xin their Fortran compilers than to scalar, although many of these
- Xloops could benefit from scalar optimization. This is most apparent
- Xin the difference between C and Fortran scalar performance on the Cray 2.
- X
- XIt is possible to write numerical codes in C and have them compiled
- Xinto efficient code which produces correct results. Any code which can
- Xbe written in Fortran can also be written in C to produce the same
- Xresults, and a good C compiler can generate efficient results from the
- Xspecified code.
- X
- X\bibliography{unix,misc}
- X\bibliographystyle{plain}
- X
- X\end{document}
- END_OF_FILE
- if test 28304 -ne `wc -c <'cloops.tex'`; then
- echo shar: \"'cloops.tex'\" unpacked with wrong size!
- fi
- # end of 'cloops.tex'
- fi
- echo shar: End of archive 3 \(of 3\).
- cp /dev/null ark3isdone
- MISSING=""
- for I in 1 2 3 ; do
- if test ! -f ark${I}isdone ; then
- MISSING="${MISSING} ${I}"
- fi
- done
- if test "${MISSING}" = "" ; then
- echo You have unpacked all 3 archives.
- rm -f ark[1-9]isdone
- else
- echo You still need to unpack the following archives:
- echo " " ${MISSING}
- fi
- ## End of shell archive.
- exit 0
-
- exit 0 # Just in case...
-